R DATA VISUALIZATION
Why Data Visualization?
Raw data point doesn’t provide much insight to kick off data
analysis.
Data Visualization is brilliant in * exploring the pattern of data
briefly at the early stage * in the final conclusion, enhance the story
telling of data analysis (inforgraphics play a huge role for this
purpose)
Reminder on Workflow Again
Built-in Plot Functions
The advantage of using built-in plotting utilities is they are easy.
It let you quickly visualize the data pattern while you are trying to
gain a brief insight at the early stage of your workflow.

Grammar of Graphics: ggplot2
If these built-in plotting tools are not enough for you, Go to
ggplot2. It is the most popular data visualization for
R.
ggplot2 is an open-source data visualization package for R. A data
visualization which breaks up graphs into semantic components such as
scales and layers. Since 2005,
ggplot2 has grown in use to become one of the most popular R
packages.
BASIC GRAMMAR
ggplot2 is based on the grammar of graphics, the idea that you can
build every graph from the same components below:
a data set + a coordinate system +
and geoms—visual marks that represent data points
Use Built-in Datasets
Let’s use the built-in cars data set for a simple ggplot

geom_point() function
It’s easy to add geometry layer to the base co-ordinate
Let’s ADD a layer of data points using geom_point()
function.
And Yes, you can ADD a layer by using the + operator We use
geom_point() function that required x and y value for
each point. In 2D co-ordinate, a point is described by its x and y
value.
We need to provide a mapping that specifies the data columns’ name to
map to a point’s the x and y value
That mapping is defined by an aesthetics function
aes()
Scatterplot is useful to explore the relation of two variables.

Use geom_line() to replace geom_point()
geom_point() and geom_line() require very similar parameters.
geom_line() is simply an enhanced visualization that automatically
connect all the points

Use geom_smoth() to project a smooth line
again geom_smooth() and geom_point() require very similar
parameters.
geom_smooth() smooths out the line progression
cars %>% ggplot() +
geom_smooth(mapping = aes(x=speed, y=dist))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Adding Multiple Layers
# just change geom_point to geom_line without change anything else
cars %>% ggplot() +
geom_point(mapping = aes(x=speed, y=dist)) +
geom_smooth(mapping = aes(x=speed, y=dist))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Adding Aesthetics to your Plots
Un-comment the extra parameters to add more aesthetics to your
plot
cars %>% ggplot() +
geom_point(mapping = aes(x=speed, y=dist),
color = "orange", # the color of data points
# size = 3, # the size of data point
# alpha = 0.5, # the transparency of data points, min=0, max=1
# shape = 0, # the shape of data point
)

PLOT WITH OUR OWN DATA
Loading Data: allowance & graduates
Allowance Data Set in Simple Scatterplot

Continuous Values vs. Discrete Values
Continuous values refer to numbers value that has wide range. E.g.
salary, height Discrete values refer to a limited number of valid
values. It can be string. It can be a few distinct numbers.
When you produce plots, pay attention to what type of value are
required by the geoms.
In many cases, you will need to convert the data first.
mutate() function are quite often used for that.
example:
allowance = allowance %>%
mutate(Assessment_Year = as.numeric(substr(Assessment_Year, 1 ,4)))
Simple Line Plot

Adding Multiple Layers of Geometry
allowance %>% ggplot() +
geom_line(mapping = aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
geom_point(mapping = aes(x=Assessment_Year, y=Basic, group=1, color="Orange"))

# As both geom use the same data mapping, the above statements can be simplified as
allowance %>% ggplot(aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
geom_line() +
geom_point(size=5)

Use geom_smooth() to smooth out the line
allowance %>% ggplot(aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
geom_smooth() +
geom_point(size=5)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Save a Plot: ggsave()
ggsave("./output/my_first_plot.png") # default image size
Saving 7 x 7 in image
Bar Chart with geom_col()

geom_bar()
geom_bar() is used for counting the frequency of each occurrence of
observed value. It’s usually for counting a limit set of value

CHALLENGE: line plot for hibor_fixing_1m
library(jsonlite) # load package
hkma.interbank.url = "https://api.hkma.gov.hk/public/market-data-and-statistics/daily-monetary-statistics/daily-figures-interbank-liquidity"
interbank.liquidity = fromJSON(hkma.interbank.url)
# the above retrieval will take a while. The server response is slow.
summary(interbank.liquidity)
Length Class Mode
header 3 -none- list
result 2 -none- list
str(interbank.liquidity)
List of 2
$ header:List of 3
..$ success : logi TRUE
..$ err_code: chr "0000"
..$ err_msg : chr "No error found"
$ result:List of 2
..$ datasize: int 100
..$ records :'data.frame': 100 obs. of 44 variables:
.. ..$ end_of_date : chr [1:100] "2022-03-30" "2022-03-29" "2022-03-28" "2022-03-25" ...
.. ..$ cu_weakside : num [1:100] 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 ...
.. ..$ cu_strongside : num [1:100] 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 ...
.. ..$ disc_win_base_rate : num [1:100] 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 ...
.. ..$ hibor_overnight : num [1:100] 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.03 0.02 ...
.. ..$ hibor_fixing_1m : num [1:100] 0.313 0.312 0.313 0.323 0.319 ...
.. ..$ twi : num [1:100] 95.3 95.7 95.9 95.6 95.7 95.7 95.6 95.5 95.4 95.4 ...
.. ..$ opening_balance : int [1:100] 337534 337534 337534 337534 337534 337536 337553 337538 337538 337538 ...
.. ..$ closing_balance : int [1:100] 337551 337534 337534 337534 337534 337534 337536 337553 337538 337538 ...
.. ..$ market_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ interest_payment : chr [1:100] "+16" "+0" "+0" "+0" ...
.. ..$ discount_window_reversal : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ discount_window_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ intraday_movements_of_aggregate_balance_at_0930: int [1:100] 363880 350696 357104 354934 363276 355552 353301 352138 351573 353926 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1000: int [1:100] 366723 358279 358428 358128 363476 361781 356632 356870 353300 355233 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1100: int [1:100] 383598 376635 375870 374775 374465 377616 374645 370373 371865 373477 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1200: int [1:100] 331056 383952 385132 383777 383416 323375 375169 381656 378485 395569 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1500: int [1:100] 334817 391180 389282 382202 390063 335473 391970 403349 403099 398017 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1600: int [1:100] 337197 392468 391754 383194 391833 337396 392436 403610 409198 401337 ...
.. ..$ forex_trans_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_t1 : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_t1 : chr [1:100] "+0" "+16" "+0" "+0" ...
.. ..$ forecast_aggregate_bal_t1 : int [1:100] 337551 337551 337534 337534 337534 337534 337534 337536 337536 337538 ...
.. ..$ forex_trans_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_t2 : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_t2 : chr [1:100] "+0" "+0" "+7" "+0" ...
.. ..$ forecast_aggregate_bal_t2 : int [1:100] 337551 337551 337541 337534 337534 337534 337534 337529 337536 337536 ...
.. ..$ forex_trans_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_t3 : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_t3 : chr [1:100] "+0" "+0" "+0" "+7" ...
.. ..$ forecast_aggregate_bal_t3 : int [1:100] 337551 337551 337541 337541 337534 337534 337534 337529 337529 337536 ...
.. ..$ forex_trans_t4 : chr [1:100] NA NA NA NA ...
.. ..$ other_market_activities_t4 : chr [1:100] NA NA NA NA ...
.. ..$ reversal_of_discount_window_t4 : chr [1:100] NA NA NA NA ...
.. ..$ interest_payment_issuance_efbn_t4 : chr [1:100] NA NA NA NA ...
.. ..$ forecast_aggregate_bal_t4 : int [1:100] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ forex_trans_u : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_u : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_u : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_u : chr [1:100] "-72" "-72" "-62" "-62" ...
.. ..$ forecast_aggregate_bal_u : int [1:100] 337479 337479 337479 337479 337479 337479 337479 337479 337479 337479 ...
interbank.liquidity$result
$datasize
[1] 100
$records
str(interbank.liquidity$result)
List of 2
$ datasize: int 100
$ records :'data.frame': 100 obs. of 44 variables:
..$ end_of_date : chr [1:100] "2022-03-30" "2022-03-29" "2022-03-28" "2022-03-25" ...
..$ cu_weakside : num [1:100] 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 ...
..$ cu_strongside : num [1:100] 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 ...
..$ disc_win_base_rate : num [1:100] 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 ...
..$ hibor_overnight : num [1:100] 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.03 0.02 ...
..$ hibor_fixing_1m : num [1:100] 0.313 0.312 0.313 0.323 0.319 ...
..$ twi : num [1:100] 95.3 95.7 95.9 95.6 95.7 95.7 95.6 95.5 95.4 95.4 ...
..$ opening_balance : int [1:100] 337534 337534 337534 337534 337534 337536 337553 337538 337538 337538 ...
..$ closing_balance : int [1:100] 337551 337534 337534 337534 337534 337534 337536 337553 337538 337538 ...
..$ market_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ interest_payment : chr [1:100] "+16" "+0" "+0" "+0" ...
..$ discount_window_reversal : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ discount_window_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ intraday_movements_of_aggregate_balance_at_0930: int [1:100] 363880 350696 357104 354934 363276 355552 353301 352138 351573 353926 ...
..$ intraday_movements_of_aggregate_balance_at_1000: int [1:100] 366723 358279 358428 358128 363476 361781 356632 356870 353300 355233 ...
..$ intraday_movements_of_aggregate_balance_at_1100: int [1:100] 383598 376635 375870 374775 374465 377616 374645 370373 371865 373477 ...
..$ intraday_movements_of_aggregate_balance_at_1200: int [1:100] 331056 383952 385132 383777 383416 323375 375169 381656 378485 395569 ...
..$ intraday_movements_of_aggregate_balance_at_1500: int [1:100] 334817 391180 389282 382202 390063 335473 391970 403349 403099 398017 ...
..$ intraday_movements_of_aggregate_balance_at_1600: int [1:100] 337197 392468 391754 383194 391833 337396 392436 403610 409198 401337 ...
..$ forex_trans_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_t1 : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_t1 : chr [1:100] "+0" "+16" "+0" "+0" ...
..$ forecast_aggregate_bal_t1 : int [1:100] 337551 337551 337534 337534 337534 337534 337534 337536 337536 337538 ...
..$ forex_trans_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_t2 : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_t2 : chr [1:100] "+0" "+0" "+7" "+0" ...
..$ forecast_aggregate_bal_t2 : int [1:100] 337551 337551 337541 337534 337534 337534 337534 337529 337536 337536 ...
..$ forex_trans_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_t3 : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_t3 : chr [1:100] "+0" "+0" "+0" "+7" ...
..$ forecast_aggregate_bal_t3 : int [1:100] 337551 337551 337541 337541 337534 337534 337534 337529 337529 337536 ...
..$ forex_trans_t4 : chr [1:100] NA NA NA NA ...
..$ other_market_activities_t4 : chr [1:100] NA NA NA NA ...
..$ reversal_of_discount_window_t4 : chr [1:100] NA NA NA NA ...
..$ interest_payment_issuance_efbn_t4 : chr [1:100] NA NA NA NA ...
..$ forecast_aggregate_bal_t4 : int [1:100] NA NA NA NA NA NA NA NA NA NA ...
..$ forex_trans_u : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_u : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_u : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_u : chr [1:100] "-72" "-72" "-62" "-62" ...
..$ forecast_aggregate_bal_u : int [1:100] 337479 337479 337479 337479 337479 337479 337479 337479 337479 337479 ...
interbank.records = interbank.liquidity$result$records %>% as_tibble()
interbank.records
interbank.records %>%
ggplot() +
geom_line(
mapping=aes(x=end_of_date, y=hibor_fixing_1m, group=1),
color="orange"
)

GROUPING AND AGGREGATION
Using group_by() and summarise()
graduates %>% group_by(AcademicYear, LevelOfStudy) %>%
summarise(TotalHeadcount = sum(Headcount)) %>%
ggplot(
aes(x=AcademicYear,
y=TotalHeadcount,
group=LevelOfStudy,
color=LevelOfStudy
)
) +
geom_line() +
geom_point()
`summarise()` has grouped output by 'AcademicYear'. You can override using the `.groups`
argument.

Use of filter()
Use filter() to keep only “Taught Postgraduate” Records
This plot is not very useful without previously applying filter() and
group_by() and summarise()

filter() + group_by() + summarise()
Use filter() to extract required rows Use group_by() and summarise()
to group and aggreate total headcout for both male and female
graduates %>%
filter(LevelOfStudy=="Taught Postgraduate") %>%
group_by(AcademicYear, ProgrammeCategory) %>%
summarise(TotalHeadcount = sum(Headcount)) %>%
ggplot() +
geom_line(mapping=aes(x=AcademicYear,y=TotalHeadcount, group=ProgrammeCategory, color=ProgrammeCategory))
`summarise()` has grouped output by 'AcademicYear'. You can override using the `.groups`
argument.

graduates %>%
filter(LevelOfStudy=="Undergraduate") %>%
group_by(AcademicYear, ProgrammeCategory) %>%
summarise(TotalHeadcount = sum(Headcount)) %>%
ggplot() +
geom_line(mapping=aes(x=AcademicYear,y=TotalHeadcount, group=ProgrammeCategory, color=ProgrammeCategory))
`summarise()` has grouped output by 'AcademicYear'. You can override using the `.groups`
argument.

NA
geom_col() function

More Aggregation Functions
Center: mean(), median() Spread: sd(), IQR(), mad() Range: min(),
max(), quantile() Position: first(), last(), nth(), Count: n(),
n_distinct() Logical: any(), all()
More information at summarise()
function
geom_bar() function
bar chart give the counting frequency (number of record in the data
set)

box plot
The boxplot compactly displays the distribution of a continuous
variable. It visualises five summary statistics (the median, two hinges
and two whiskers), and all “outlying” points individually.


MAKE IT PRETTY
Use of title, label, background color and themes
# in this example we save the plot to a variable name 'level.bar.plot' so that we can use it again and again.
level.bar.plot = graduates %>%
filter(ProgrammeCategory=="Engineering and Technology") %>%
ggplot() +
geom_col(mapping=aes(x=AcademicYear, y=Headcount, fill=LevelOfStudy))
# To show the plot, just use print() function with the previous saved plot variable as parameter.
print(level.bar.plot)

Plot Background
element_rect() is a function to generated rectangle
geometry element. You have to specify the fill
parameter by color name or hex code code by string
Plot Background refers to the big area of everything relevant to the
plot.

Panel Background
Panel background refers to the inner area of plot. Area for showing
header, axis lable and legend are NOT included.


level.bar.plot # default style
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_rect(fill="orange")) # styling the panel background

Remove Plot and Panel Background
In visual design, color is very powerful tool to guide users’
attention. But you have to use them carefully.
Too many colors will usually do the opposite - confuse the audience.
Minimal design is the recent trend. Expecially true when many are using
small device like mobile phone for day-to-day communication.
In this example, we are removing both plot and panel background to
achieve a clean design. After all, background is the main dish. Very
often background color causes distraction to graph.

Change Label for x/y Axis
level.bar.plot # default style

level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") # Label for X axis

Ratate the Labe Text
level.bar.plot # default style

level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(axis.text.x = element_text(angle = 45))

Change Fill Colors
level.bar.plot # default style

level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"))

Styling The Legends
level.bar.plot # default style

level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(legend.position="top") +
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
guide = guide_legend(title="Level of Study",
label.position = "bottom")
)

NA
Add Title and Subtitle
level.bar.plot # default style

level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(legend.position="top") +
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
guide = guide_legend(title="Level of Study",
label.position = "bottom")
) + # move legend position to top and label position to bottom
ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2019")

Add Annotations Texts
Add extra texts/shape to enhance your visualization
level.bar.plot # default style

level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(legend.position="top") +
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
guide = guide_legend(title="Level of Study",
label.position = "bottom")
) + # move legend position to top and label position to bottom
ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2019") +
annotate("text", label="Record\nHigh", x="2017/18", y=5300) # you can change value of x and y to set the text position

Adding Reference Lines

Using Themes
level.bar.plot # default style

level.bar.plot +
theme_bw() # black and white theme

level.bar.plot +
theme_minimal() # black and white theme

level.bar.plot +
theme_dark() # black and white theme

More 3rd-party Themes
Install ggthemes package to unlock wider selections
of themes.
if (!require("pacman")) install.packages("pacman") # check if pacman already installed. If not, install it.
pacman::p_load(ggthemes)
level.bar.plot # default style

level.bar.plot +
theme_excel() + # Excel Theme
ggtitle("Excel Theme")

level.bar.plot +
theme_wsj() + # Wall Street Journal Theme
ggtitle("Wall Street Journal Theme")

level.bar.plot +
theme_economist() + # Economist Theme
ggtitle("Economist Theme")

level.bar.plot +
theme_fivethirtyeight() + # Five Thirty Eight
ggtitle("Five Thirty Eight Theme")

EXPLORING & ANALYZING WITH MODELS
Data Science is combination of efforts and results of programming,
mathematics and domain expertise. Among all, mathematics is the
foundation of models. With models, data scientists make predictions;
discover hidden patterns; and conclude insights.
Modeling is usually an iterative process among data transformation,
data visualization, exploring with models and fitting.
What exactly is a Model?
Human are good in drawing conclusions and providing insight while are
NOT good in directly facing large number of data attributes and huge
volume of raw data.
A model is mathematics expression that provides a simple
low-dimensional summary of a data set so that we can
draw conclusion and even providing insights. Models only provide
approximation (NOT the exact truth).
Basic Concepts of Model
Let’s do some simple R coding to uncover the basic concept of
model
if (!require("pacman")) install.packages("pacman") # install pacman
pacman::p_load(pacman, tidyverse, modelr, magrittr) # install (or load) required packages
Let’s use a simple built-in data set sim1 for
exploring. In this simulation data you can strongly see the pattern with
the help of simple scatterplot.
print(sim1)
ggplot(sim1, aes(x, y)) +
geom_point()

Generating a Random Linear Model
Linear model is widely used to explore the relation of two variables.
A linear model is described as y = a1 + x * a2
Let’s generate a random value of a1 as intercept and
a2 as slope. Here, we use runinf() to
generated a random uniform distributed number
model = tibble(
a1 = runif(1, -20, 40), # random intercept value between -20 to 40
a2 = runif(1, -5, 5) # random slop value between -5 to 5
)
print(model)
ggplot(sim1, aes(x,y)) +
geom_point() +
geom_abline(aes(intercept = a1, slope = a2), data=model, color="Orange")

Generating 250 Random Models as Candidate Models
The number of potential models are unlimited. Let’s try to generate
250 random ones as candiates.
Among these 250 models, some are very bad even by just taking glances.
Some are not bad but we don’t know which one is the best among them.
models = tibble(
a1 = runif(250, -20, 40), # 250 random intercept values between -20 to 40
a2 = runif(250, -5, 5) # 250 random slop values between -5 to 5
)
ggplot(sim1, aes(x,y)) +
geom_point() +
geom_abline(aes(intercept = a1, slope = a2), data=models, alpha=0.2)

Selecting the Most Fitting Ten Models
# this function calculates the modeled y value of each given x oberservation
modeled_y = function(a, data) {
a[1] + data$x * a[2] # a[1] is the intercept and a[2] is the slope
}
# this function calculates ONE distance between a observed y value to the modeled y value
measure_distance = function(mod, data) {
diff <- data$y - modeled_y(mod, data) # mod is random intercept and slope of a certain model
sqrt(mean(diff ^ 2))
}
# this function calucautes ALL distance for a given model with a1 as intercept and a2 as slope
sim1_dist = function(a1, a2) {
measure_distance(c(a1, a2), sim1) # a1 is the intercept of a model while a2 is the slope
}
# use map2_dbl (a mapping function) to a new column named 'dist' to each random model
models %<>%
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, color = -dist) ,
data = filter(models, rank(dist) <= 10) # To show only the best 5, change 10 to five
)

Using lm() function
In fact, we didn’t have to do all the previous complex coding. Things
can be very handy by using built-in R feature. There is a function named
lm() (a linear model fitting function)
lm() actually finds the closest model in a single step,
using a sophisticated algorithm that involves geometry, calculus, and
linear algebra
predict(sim1_auto_model, new.data)
1 2 3 4 5 6 7 8
6.272355 8.323888 10.375421 12.426954 14.478487 16.530020 18.581553 20.633087
9 10
22.684620 24.736153
Categorical Variable
Recoding Data
Scaling
Transforming Outliers
More on Model
unnest()
---
title: "R Intermediate - Day 2"
output: html_notebook
---

------------------------------------------------------------------------


# R DATA VISUALIZATION

## Why Data Visualization?
Raw data point doesn't provide much insight to kick off data analysis.\

Data Visualization is brilliant in
* exploring the pattern of data briefly at the early stage
* in the final conclusion, enhance the story telling of data analysis (inforgraphics play a huge role for this purpose)

## Reminder on Workflow Again

![R Data Science Workflow](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png)

## Built-in Plot Functions
The advantage of using built-in plotting utilities is they are easy.
It let you quickly visualize the data pattern while you are trying to gain a brief insight at the early stage of your workflow.
```{r}
plot(iris)
```

## Reflash on Built-in Plot Tools
For built-in R data visualization, go to the **R Programming Intro** project on Github to reflash your memory
(R Intro Source Codes)[https://github.com/ngsanluk/R-Intro]


## Grammar of Graphics: ggplot2

If these built-in plotting tools are not enough for you,\ 
Go to **ggplot2**. It  is the most popular data visualization for R.

ggplot2 is an open-source data visualization package for R. 
A data visualization which breaks up graphs into semantic components such as **scales** and **layers**.\ 
Since 2005, ggplot2 has grown in use to become one of the most popular R packages.


## ggplot2 Cheat Sheet

[ggplot 2 cheat sheet](https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf)

------------------------------------------------------------------------

# BASIC GRAMMAR
ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same components below: 

**a data set** 
+ 
**a coordinate system** 
+ 
**and geoms—visual marks that represent data points**

![Grammar of Graphics](https://jules32.github.io/r-for-excel-users/img/rstudio-cheatsheet-ggplot.png)


## Use Built-in Datasets
Let's use the built-in cars data set for a simple ggplot

```{r}
print(cars)
```

```{r Generating Empty Plot}
cars %>% ggplot() # This only specifies a data set and a coordinate system and therefore an empty plot
```

## geom_point() function
It's easy to add geometry layer to the base co-ordinate\
Let's ADD a layer of data points using **geom_point()** function. \

And Yes, you can ADD a layer by using the + operator
We use **geom_point()** function that required x and y value for each point.
In 2D co-ordinate, a point is described by its x and y value.

We need to provide a mapping that specifies the data columns' name to map to a point's the x and y value

That mapping is defined by an aesthetics function\
**aes()**

Scatterplot is useful to explore the relation of two variables.

```{r}
cars %>% ggplot() +
  geom_point(mapping = aes(x=speed, y=dist))
```

## Use geom_line() to replace geom_point()
geom_point() and geom_line() require very similar parameters.\
geom_line() is simply an enhanced visualization that automatically connect all the points

```{r}
# just change geom_point to geom_line without change anything else
cars %>% ggplot() +
  geom_line(mapping = aes(x=speed, y=dist)) 
```


## Use geom_smoth() to project a smooth line
again geom_smooth() and geom_point() require very similar parameters.\
geom_smooth() smooths out the line progression

```{r}
# just change geom_point to geom_line without change anything else
cars %>% ggplot() +
  geom_smooth(mapping = aes(x=speed, y=dist)) 
```
## Adding Multiple Layers
```{r}
# just change geom_point to geom_line without change anything else
cars %>% ggplot() +
  geom_point(mapping = aes(x=speed, y=dist)) +
  geom_smooth(mapping = aes(x=speed, y=dist)) 
```

## Adding Aesthetics to your Plots
Un-comment the extra parameters to add more aesthetics to your plot
```{r}
cars %>% ggplot() +
  geom_point(mapping = aes(x=speed, y=dist),
             color = "orange", # the color of data points
             # size = 3, # the size of data point
             # alpha = 0.5, # the transparency of data points, min=0, max=1
             # shape = 0, # the shape of data point
             )
```

------------------------------------------------------------------------

# PLOT WITH OUR OWN DATA

## Loading Data: allowance & graduates
```{r reading data files}
allowance = read_csv("./data/allowance.csv")
print(allowance)
```


## Allowance Data Set in Simple Scatterplot
```{r Allowance Scatterplot}
allowance %>% 
  ggplot() + 
  geom_point(
    mapping=aes(x=Assessment_Year, y=Basic),
    color = "orange",
    size = 3
    ) 
```


## Continuous Values vs. Discrete Values
Continuous values refer to numbers value that has wide range. E.g. salary, height
Discrete values refer to a limited number of valid values.  It can be string. It can be a few distinct numbers.

When you produce plots, pay attention to what type of value are required by the geoms.

In many cases, you will need to convert the data first.
**mutate()** function are quite often used for that.

example:
```{r}
allowance = allowance %>% 
  mutate(Assessment_Year = as.numeric(substr(Assessment_Year, 1 ,4))) 
```


## Simple Line Plot
```{r Allowance Line Graph}

# The following statement WON'T generate a plot
allowance %>% ggplot() +
  geom_line(mapping = aes(x=Assessment_Year, y=Basic))

# For line graphs, the data points must be grouped so that it knows which points to connect. 
# In this case, all points should be connected, so group=1. 
# When more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable.
allowance %>% ggplot() +
  geom_line(mapping = aes(x=Assessment_Year, y=Basic, group=1, color="Orange"))
```

## Adding Multiple Layers of Geometry
```{r}
allowance %>% ggplot() +
  geom_line(mapping = aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
  geom_point(mapping = aes(x=Assessment_Year, y=Basic, group=1, color="Orange"))

# As both geom use the same data mapping, the above statements can be simplified as
allowance %>% ggplot(aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
  geom_line() +
  geom_point(size=5)
```

## Use geom_smooth() to smooth out the line
```{r}
allowance %>% ggplot(aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
  geom_smooth() +
  geom_point(size=5)
```



## Save a Plot: ggsave()
```{r}
my.first.plot = allowance %>% 
  ggplot() + 
  geom_point(
    mapping=aes(x=Assessment_Year, y=Basic),
    color = "orange",
    size = 3,
    ) 

print(my.first.plot)

ggsave("./output/my_first_plot.png") # default image size
ggsave("./output/my_first_plot_large.png", width=4000, height=2000, unit="px")

```

## Bar Chart with geom_col()

```{r}
allowance %>% 
  ggplot() +
  geom_col(mapping=aes(x=Assessment_Year, y=Basic),
           fill="tomato") 
```




## geom_bar()
geom_bar() is used for counting the frequency of each occurrence of observed value.
It's usually for counting a limit set of value

```{r}
allowance %>% 
  ggplot() +
  geom_bar(mapping=aes(x=Basic))  
```

------------------------------------------------------------------------

# CHALLENGE
## Multiple Layers of Lines
add line plot for column of Child
in the same plot, add another line plot for Dependent_Parent_60
```{r}
allowance %>% ggplot() +
  geom_line(mapping = aes(x=Assessment_Year, y=Child, group=1), color="Orange") + 
  geom_line(mapping = aes(x=Assessment_Year, y=Dependent_Parent_60, group=1), color="Blue") 
```

------------------------------------------------------------------------

# WORK WITH MORE COMPLEX DATA

## Loading Data: graduates.csv
```{r reading complex data}
graduates = read_csv("./data/graduates.csv")
print(graduates)
```

## Simple Scatterplot
Let's explore the data with some simple ggplot plot. 
Overall they are not very useful.  Just some quick exploration.

```{r}
ggplot(data=graduates) +
  geom_point(mapping=aes(x=AcademicYear, y=Headcount))

ggplot(data=graduates) +
  geom_point(mapping=aes(x=AcademicYear, y=Headcount, shape=Sex) # use shape to differentiate groups
             )

ggplot(data=graduates) +
  geom_point(mapping=aes(x=AcademicYear, y=Headcount, color=Sex) # use color to differentiate groups
             )
```

## Use filter() to Extract Required Rows
```{r use filter()}
graduates %>% 
filter(LevelOfStudy=="Undergraduate", ProgrammeCategory=="Business and Management") %>%
ggplot() +
  geom_point(mapping=aes(x=AcademicYear, y=Headcount, color=Sex)
)
```


## Use Line Plot To Explore the Trending 
Use line plot to explore the trending of "Business and Management" student headcount trending


```{r}
library(magrittr)
graduates %<>%
  mutate(AcademicYear=as.factor(AcademicYear),
         Sex=as.factor(Sex)
         ) # convert the AcademicYear and Sex to factor type

graduates %>% 
filter(LevelOfStudy=="Undergraduate", ProgrammeCategory=="Business and Management") %>%
ggplot() +
  geom_line(
    mapping=aes(x=AcademicYear, 
                         y=Headcount,
                         group=Sex, 
                         color=Sex
                         )
             )

```

------------------------------------------------------------------------

# CHALLENGE
## Comparison with Line Plots
Use line plots to compare female undergraduate students headcount trending in ProgrammeCategory of "Business and Management" and "Engineering and Technology"
Use filter() to extract required record
You can use multiple filter() call
Use &, | or multiple conditions


```{r Previewing ProgramCategory}
graduates %>% 
  .$ProgrammeCategory %>% 
  unique() # display the unique names of ProgrammeCategory

graduates %>% 
  filter(LevelOfStudy=="Undergraduate", Sex=="F") %>%
  filter(ProgrammeCategory=="Business and Management" | ProgrammeCategory=="Engineering and Technology") %>% 
  print() # Test extracting and printing the required records.
```


```{r Business and Management vs. Engineering and Technology}
graduates %>% 
  filter(LevelOfStudy=="Undergraduate", Sex=="F") %>%
  filter(ProgrammeCategory=="Business and Management" | ProgrammeCategory=="Engineering and Technology") %>% 
  ggplot(
    aes(x=AcademicYear, 
             y=Headcount,
             group=ProgrammeCategory, 
             color=ProgrammeCategory
             )
    ) +
    geom_line() +
    geom_point() 
  
```

------------------------------------------------------------------------


# CHALLENGE: line plot for hibor_fixing_1m
```{r}
library(jsonlite) # load package
hkma.interbank.url = "https://api.hkma.gov.hk/public/market-data-and-statistics/daily-monetary-statistics/daily-figures-interbank-liquidity"
interbank.liquidity = fromJSON(hkma.interbank.url)
# the above retrieval will take a while.  The server response is slow.
summary(interbank.liquidity)
str(interbank.liquidity)
interbank.liquidity$result
str(interbank.liquidity$result)
interbank.records = interbank.liquidity$result$records %>% as_tibble()
interbank.records

interbank.records %>% 
ggplot() +
  geom_line(
    mapping=aes(x=end_of_date, y=hibor_fixing_1m, group=1),
    color="orange"
             )

```


------------------------------------------------------------------------

# GROUPING AND AGGREGATION

## Using group_by() and summarise() 
```{r}
graduates %>% group_by(AcademicYear, LevelOfStudy) %>% 
  summarise(TotalHeadcount = sum(Headcount)) 

graduates %>% group_by(AcademicYear, LevelOfStudy) %>% 
  summarise(TotalHeadcount = sum(Headcount)) %>% 
  ggplot(
    aes(x=AcademicYear, 
             y=TotalHeadcount,
             group=LevelOfStudy, 
             color=LevelOfStudy
             )
    ) +
    geom_line() +
    geom_point() 
  
```

## Use of filter()
Use filter() to keep only "Taught Postgraduate" Records

This plot is not very useful without previously applying filter() and group_by() and summarise()

```{r}
graduates %>% 
  filter(LevelOfStudy=="Taught Postgraduate") %>% 
  ggplot() +
    geom_line(mapping=aes(x=AcademicYear,y=Headcount, group=ProgrammeCategory, color=ProgrammeCategory))
```


## filter() + group_by() + summarise()
Use filter() to extract required rows
Use group_by() and summarise() to group and aggreate total headcout for both male and female
```{r ggplot line}

graduates %>% 
  filter(LevelOfStudy=="Taught Postgraduate") %>% 
  group_by(AcademicYear, ProgrammeCategory) %>% 
  summarise(TotalHeadcount = sum(Headcount)) %>% 
  ggplot() +
    geom_line(mapping=aes(x=AcademicYear,y=TotalHeadcount, group=ProgrammeCategory, color=ProgrammeCategory))

# Following is the same chart for "undergraduate" 
# graduates %>%
#   filter(LevelOfStudy=="Undergraduate") %>%
#   group_by(AcademicYear, ProgrammeCategory) %>%
#   summarise(TotalHeadcount = sum(Headcount)) %>%
#   ggplot() +
#     geom_line(mapping=aes(x=AcademicYear,y=TotalHeadcount, group=ProgrammeCategory, color=ProgrammeCategory))
    
```



## geom_col() function

```{r}
LevelOfStudy = graduates %>% .$LevelOfStudy %>% unique()
ProgrammeCategory = graduates %>% .$ProgrammeCategory %>% unique() 
print(LevelOfStudy)
print(ProgrammeCategory)
  
graduates %>% 
filter(ProgrammeCategory=="Business and Management") %>% 
ggplot() +
  geom_col(mapping=aes(x=AcademicYear, y=Headcount, fill=LevelOfStudy))

graduates %>% 
filter(ProgrammeCategory=="Engineering and Technology") %>% 
ggplot() +
  geom_col(mapping=aes(x=AcademicYear, y=Headcount, fill=LevelOfStudy))

```

## More Aggregation Functions
Center: mean(), median()
Spread: sd(), IQR(), mad()
Range: min(), max(), quantile()
Position: first(), last(), nth(),
Count: n(), n_distinct()
Logical: any(), all()

More information at
[summarise() function](https://dplyr.tidyverse.org/reference/summarise.html)


## geom_bar() function
bar chart give the counting frequency (number of record in the data set)
```{r}

graduates %>% 
ggplot() +
  geom_bar(mapping=aes(x=AcademicYear)) # you only need to provide the x axis
```

## box plot
The boxplot compactly displays the distribution of a continuous variable.\n
It visualises five summary statistics (the median, two hinges and two whiskers), and all "outlying" points individually.

```{r}
graduates %>% 
ggplot() +
  geom_point(mapping=aes(x=Sex, y=Headcount))

graduates %>% 
ggplot() +
  geom_boxplot(mapping=aes(x=LevelOfStudy, y=Headcount))
```


------------------------------------------------------------------------

# MAKE IT PRETTY
Use of title, label, background color and themes

```{r}
# in this example we save the plot to a variable name 'level.bar.plot' so that we can use it again and again.
level.bar.plot = graduates %>% 
filter(ProgrammeCategory=="Engineering and Technology") %>% 
ggplot() +
  geom_col(mapping=aes(x=AcademicYear, y=Headcount, fill=LevelOfStudy))

# To show the plot, just use print() function with the previous saved plot variable as parameter.
print(level.bar.plot)
```


## Plot Background
**element_rect()** is a function to generated rectangle geometry element.
You have to specify the **fill** parameter by color name or hex code code by string

Plot Background refers to the big area of everything relevant to the plot.

```{r}
level.bar.plot # default style

level.bar.plot +
  theme(plot.background = element_rect(fill="orange")) # styling the plot background
```

## Panel Background
Panel background refers to the inner area of plot. Area for showing header, axis lable and legend are NOT included.
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_rect(fill="orange")) # styling the panel background
```

## Remove Plot and Panel Background
In visual design, color is very powerful tool to guide users' attention. But you have to use them carefully.

Too many colors will usually do the opposite - confuse the audience.
Minimal design is the recent trend.  Expecially true when many are using small device like mobile phone for day-to-day communication.

In this example, we are removing both plot and panel background to achieve a clean design.  After all, background is the main dish. Very often background color causes distraction to graph.

```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) # styling the grid line for y-axis
```

## Change Label for x/y Axis
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
  ylab("Number of Student") + # Label for Y axis
  xlab("Year") # Label for X axis
```

## Ratate the Labe Text
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
  ylab("Number of Student") + # Label for Y axis
  xlab("Year") + # Label for X axis
  theme(axis.text.x = element_text(angle = 45))
```



## Change Fill Colors
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
  ylab("Number of Student") + # Label for Y axis
  xlab("Year") + # Label for X axis
  scale_fill_manual(values=c("purple", "orange", "blue", "tomato")) # use c() function to specify color list
```

## Styling The Legends
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
  ylab("Number of Student") + # Label for Y axis
  xlab("Year") + # Label for X axis
  theme(legend.position="top") +
  scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
                    guide = guide_legend(title="Level of Study", 
                                         label.position = "bottom")
                    ) # move legend position to top and label position to bottom 
  
```

## Add Title and Subtitle
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
  ylab("Number of Student") + # Label for Y axis
  xlab("Year") + # Label for X axis
  theme(legend.position="top") +
  scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
                    guide = guide_legend(title="Level of Study", 
                                         label.position = "bottom")
                    ) + # move legend position to top and label position to bottom 
  ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2019")
```

## Add Annotations Texts
Add extra texts/shape to enhance your visualization
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
  ylab("Number of Student") + # Label for Y axis
  xlab("Year") + # Label for X axis
  theme(legend.position="top") +
  scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
                    guide = guide_legend(title="Level of Study", 
                                         label.position = "bottom")
                    ) + # move legend position to top and label position to bottom 
  ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2019") + 
  annotate("text", label="Record\nHigh", x="2017/18", y=5300) # you can change text position value of x and y to set the text position
```

## Adding Reference Lines
```{r}
level.bar.plot # default style

level.bar.plot +
  theme(panel.background = element_blank()) + # styling the panel background to none
  theme(plot.background = element_blank()) + # styling the plot background to none
  theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
  ylab("Number of Student") + # Label for Y axis
  xlab("Year") + # Label for X axis
  theme(legend.position="top") +
  scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
                    guide = guide_legend(title="Level of Study", 
                                         label.position = "bottom")
                    ) + # move legend position to top and label position to bottom 
  ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2019") + 
  annotate("text", label="Record\nHigh", x="2017/18", y=5300) + # you can change text position value of x and y to set the text position
  geom_hline(yintercept=3200) + # adds horizontal line
  geom_vline(xintercept = "2017/18", color="white") # adds vertical line
  
```


## Using Themes
```{r}
level.bar.plot # default style

level.bar.plot +
  theme_bw() # black and white theme

level.bar.plot +
  theme_minimal() # black and white theme

level.bar.plot +
  theme_dark() # black and white theme
```

##  More 3rd-party Themes
Install **ggthemes** package to unlock wider selections of themes.

```{r}
if (!require("pacman")) install.packages("pacman") # check if pacman already installed. If not, install it.
pacman::p_load(ggthemes)

level.bar.plot # default style

level.bar.plot +
  theme_excel() + # Excel Theme
  ggtitle("Excel Theme")
  

level.bar.plot +
  theme_wsj() + # Wall Street Journal Theme
  ggtitle("Wall Street Journal Theme")

level.bar.plot +
  theme_economist() + # Economist Theme
  ggtitle("Economist Theme")

level.bar.plot +
  theme_fivethirtyeight() + # Five Thirty Eight
  ggtitle("Five Thirty Eight Theme")

```


------------------------------------------------------------------------

# MORE RESOURCES ON ggplot2

## official website
```{r}
browseURL("https://ggplot2.tidyverse.org/")
```

## extentsions
```{r}
browseURL("https://exts.ggplot2.tidyverse.org/")
```


------------------------------------------------------------------------

# EXPLORING & ANALYZING WITH MODELS

![Work Follow of Data Science](https://d33wubrfki0l68.cloudfront.net/e5bf2a8f4c787a12facbc0b4191fc82bd192f4c5/4e5d2/diagrams/data-science-model.png)

Data Science is combination of efforts and results of programming, mathematics and domain expertise.  Among all, mathematics is the foundation of models.  With models, data scientists make predictions; discover hidden patterns; and conclude insights. 

Modeling is usually an iterative process among data transformation, data visualization, exploring with models and fitting.

## What exactly is a Model?

Human are good in drawing conclusions and providing insight while are NOT good in directly facing large number of data attributes and huge volume of raw data.  

![Models Example](https://d33wubrfki0l68.cloudfront.net/e28a66adf6e8b2d127db8d3af9ac992a2abb87ce/47308/model-basics_files/figure-html/unnamed-chunk-45-1.png)

A model is mathematics expression that provides a simple low-dimensional **summary** of a data set so that we can draw conclusion and even providing insights.  Models only provide approximation (NOT the exact truth).


## Basic Concepts of Model

Let's do some simple R coding to uncover the basic concept of model

```{r Loading Required Packages }
if (!require("pacman")) install.packages("pacman") # install pacman
pacman::p_load(pacman, tidyverse, modelr, magrittr) # install (or load) required packages
```


Let's use a simple built-in data set **sim1** for exploring.
In this simulation data you can strongly see the pattern with the help of simple scatterplot.

```{r Using sim1 Data Set}
print(sim1)
ggplot(sim1, aes(x, y)) +
  geom_point()
```

## Generating a Random Linear Model
Linear model is widely used to explore the relation of two variables.
A linear model is described as
y = a1 + x * a2

Let's generate a random value of a1 as **intercept** and a2 as **slope**.
Here, we use **runinf()** to generated a random uniform distributed number
```{r}
model = tibble(
  a1 = runif(1, -20, 40), # random intercept value between -20 to 40
  a2 = runif(1, -5, 5) # random slop value between -5 to 5
)
print(model)

ggplot(sim1, aes(x,y)) +
  geom_point() +
  geom_abline(aes(intercept = a1, slope = a2), data=model, color="Orange")
```

## Generating 250 Random Models as Candidate Models
The number of potential models are unlimited.  Let's try to generate 250 random ones as candiates.  
Among these 250 models, some are very bad even by just taking glances. Some are not bad but we don't know which one is the best among them.

```{r}
models = tibble(
  a1 = runif(250, -20, 40), # 250 random intercept values between -20 to 40
  a2 = runif(250, -5, 5) # 250 random slop values between -5 to 5
)

ggplot(sim1, aes(x,y)) +
  geom_point() +
  geom_abline(aes(intercept = a1, slope = a2), data=models, alpha=0.2)
```
## Selecting the Most Fitting Ten Models
```{r}
# this function calculates the modeled y value of each given x oberservation
modeled_y = function(a, data) {
  a[1] + data$x * a[2] # a[1] is the intercept and a[2] is the slope
}

# this function calculates ONE distance between a observed y value to the modeled y value
measure_distance = function(mod, data) {
  diff <- data$y - modeled_y(mod, data) # mod is random intercept and slope of a certain model
  sqrt(mean(diff ^ 2))
}

# this function calucautes ALL distance for a given model with a1 as intercept and a2 as slope
sim1_dist = function(a1, a2) {
  measure_distance(c(a1, a2), sim1) # a1 is the intercept of a model while a2 is the slope
}

# use map2_dbl (a mapping function) to a new column named 'dist' to each random model
models %<>% 
  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, color = -dist) , 
    data = filter(models, rank(dist) <= 10) # To show only the best 5, change 10 to five
  )

```

## Using lm() function
In fact, we didn't have to do all the previous complex coding. 
Things can be very handy by using built-in R feature. 
There is a function named **lm()** (a linear model fitting function)
**lm()** actually finds the closest model in a single step, using a sophisticated algorithm that involves geometry, calculus, and linear algebra
```{r}
sim1_auto_model = lm(y ~ x, data = sim1) # finding the optimized linear model
print(sim1_auto_model) # print out the auto generated linear model
print(summary(sim1_auto_model)) # print out the summary of the generated linear model

sim1_coef = coef(sim1_auto_model) # retrieves model's intercept and slope
print(sim1_coef)

# visualize the auto generated linear model on top of the sim1 data
ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = sim1_coef[1], slope = sim1_coef[2]) 
  )

new.data = data.frame(x = c(1,2,3,4,5,6,7,8,9,10))
new.data
predict(sim1_auto_model, new.data)
```


## Categorical Variable

## Recoding Data

## Scaling

## Transforming Outliers

## More on Model


------------------------------------------------------------------------



## unnest()


